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Abstract 

In this theoretical paper we are concerned with the problem of learning a 
value function by a smooth general function approximator, to solve a de- 
terministic episodic control problem in a large continuous state space. It is 
shown that learning the gradient of the value-function at every point along 
a trajectory generated by a greedy policy is a sufficient condition for the 
trajectory to be locally extremal, and often locally optimal, and we argue 
that this brings greater efficiency to value-function learning. This contrasts 
to traditional value-function learning in which the value-function must be 
learnt over the whole of state space. 

It is also proven that policy-gradient learning applied to a greedy policy 
on a value-function produces a weight update equivalent to a value-gradient 
weight update, which provides a surprising connection between these two 
alternative paradigms of reinforcement learning, and a convergence proof 
for control problems with a value function represented by a general smooth 
function approximator. 
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1. Introduction 

Reinforcement learning (RL) is the study of how an agent can learn ac- 
tions that maximise some given reward function. For example a typical 
scenario is a robot (or agent) wandering around in an environment, such 
that at time t it has position (or state) vector Xf. The robot moves in con- 
tinuous space and time, but we discretize time to enable modelling of the 
motion by a computer. Thus at each time t the robot chooses an action at 
which takes it to the next state according to the environment's model func- 
tion Xt+i = f{xt,dt), and gives it an immediate scalar reward r^, given by 
the reward function = r{xt,dt). The robot keeps moving until it reaches 
one of the designated terminal states. The RL problem is for the robot to 
learn how to choose actions so as to maximise the total reward received, Str^. 
Specifically, the problem is to find a policy function 71(0;, z) (where z is some 
parameter vector) that calculates which action a = 7r{x, z) to take for any 
given state such that the total reward is maximised. 

One key approach to tackle this RL problem is to assign a score to every 
point in state space that gives the best possible total reward attainable if 
starting from that state. This scoring function is called the optimal value 
function, V*{x). If this function was perfectly known then it would be easy 
for the robot to behave optimally because at any instant it could consider 
all possible actions available and always choose the one that leads to the 
best valued state, whilst also taking into account the immediate short-term 
reward in getting there. This way of acting is called the greedy policy on 
V* . So the objective of learning is to make a function approximator, V{x, w) 
(e.g. a neural network with weight vector t/J), learn and represent the optimal 
value function, and then use a greedy policy on the approximated function. 

However the optimal value function is not known at the start of learning. 
So for any given policy 7r(x, z) we can define its value function V'^{x, z) to be 
the real valued total reward that would be encountered if the robot started 
at state x and followed that policy until termination. Bellman's Optimality 
Condition [l| shows that li V = for all x in the state space 5", where 
TT is the greedy policy on V , then that greedy pohcy is optimal. There is 
a circular interdependence here; depends on the greedy pohcy vr, which 
depends on V , and we want V = for all x. 

If the state space was discrete and finite then Bellman's condition could 
be met by dynamic programming which makes iterative sweeps through the 
whole of state space, updating V incrementally. But in our problem the 
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state space is large and continuous so this is not possible. The RL methods 
TD(0) and Q-learning Q, Q can be used to update V along one trajectory 
at a time, but these can be very slow since Bellman's condition still needs 
meeting over the entire state space for optimality. Even if Bellman's condition 
is perfectly satisfied along a single trajectory, performance can be extremely 
far from optimal if Bellman's condition is not satisfied over the neighbouring 
trajectories too. Hence it is well known in the RL community that constant 
exploration of the environment must be applied. This exploration could be 
provided by stochastic model functions, a stochastic policy, or a stochastic 
start point for each trajectory. The ability of RL algorithms to work in 
stochastic environments is a virtue, but it is also a necessity for the above 
reason, and it is a goal of this paper to define value-function learning methods 
that work in a deterministic environment. 

TD(A) j3] is a generalization to TD(0) which uses an extra parameter 
A G [0,1] that can improve the speed of learning. The effect of A is described 
in detail in section \2.2\ where we call it the "bootstrapping" parameter. 

Although value function learning methods have produced successes in 
robot control 0, [6|, value function learning methods are problematic in that 
their theoretical convergence guarantees with function approximators are lim- 
ited. TD(A) has been proven to converge 0] provided the function approx- 
imator for V{x,w) is linear in w, and the policy is fixed (i.e. that excludes 
the greedy policy on V). They are not proven to converge when a general 
function approximator is used to represent the value function (e.g. a neural 
network) or when a greedy policy is used, such as is required by our robot RL 
problem. Divergence examples exist for a non-linear function approximator 
[tI], and where V is linear in w but where a greedy policy is used (diverging 
for both A = and A = 1; see section 4.3 of |8|). 

One reason that these methods do not always converge is that changing 
the approximated value function V at one point in state space will cause V 
to change in other points of state space too, since the function approximator 
that represents V{x, w) cannot be infinitely flexible. A second reason is that 
in the Bellman condition, depends on tt which in turn depends on V", so 
making progress in learning one of them can undo progress in learning the 
other. This second issue is highly relevant for RL control problems, since 
the ultimate objective is not just to learn a value function for some fixed 
policy, but is to improve a policy until it becomes optimal (or close enough 
to optimal). Thus any successful convergence analysis for value function 
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learning must cope with the concurrent updating of V and the greedy pohcy, 
and there have been few insights into this problem by the RL literature — 
convergence proofs so far have generally treated one of these two components 
as fixed, or only treated the tabular case. 

We address these issues by following the method of Dual Heuristic Pro- 
gramming (DHP) of Werbos jof which tries to explicitly learn the gradient 
of the value function with respect to the state vector, i.e. it learns in- 
stead of directly. We call this method value gradient learning (VGL), 
to distinguish it from the usual direct updates to the values of the value 
function, which we refer to as value learning methods (VL). We extend Wer- 
bos' method to include a bootstrapping parameter A (just as Sutton did in 
extending TD(0) into TD(A)), to give the algorithm we call VGL(A). 

The VGL method addresses the issue of the Bellman equation needing 
to be solved over the whole of state space, in that it turns out to be only 
necessary to learn the value gradient along a single trajectory for it to be 
locally optimal. This contrasts strongly with the VL methods which need to 
learn the value function over all immediately neighbouring trajectories too 
for local optimality, and so this is a significant efficiency gain for the VGL 
method. This optimality is an almost-immediate consequence of Pontryagin's 
maximum principle [10], and this is proven in Section |3l 

We address the difficulty of analysing the interdependence of simultane- 
ously updating and n by showing (in Lemma [7]) that the dependency of 
a greedy policy on a value function is primarily through the value- gradient. 
Hence a value-gradient analysis is necessary at some level to provide a theo- 
retical gateway to analysing the convergence properties of any value-function 
weight update that uses a greedy policy; be it a VGL weight update or a VL 
weight update. 

The dependency of the greedy policy on the value-gradient has already 
been exploited in an efficient policy j^, but the VGL method takes this one 
step further by trying to explicitly learn the value- gradient. 

There is an alternative paradigm of RL called policy gradient learning 
(PGL) which does not rely on learning a value function at all. We define PGL 
as algorithms that do gradient ascent on the total reward, and this definition 



includes methods of ll2|, |13|, [l^ . These methods have natural convergence 



guarantees since they are hill climbing strategies on a function with an upper 



bound, and have proved successful at robot control in continuous spaces [13 



In Section HI we show a VGL weight update with A = 1 is identical 
to a PGL weight update, and this makes a theoretical connection between 
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these two different paradigms of RL, and provides a convergence proof for 
this value function control problem with a general function approximator 
(provided A = 1 and provided the policy is different iable at every time step 
of the trajectory). 

In summary, the VGL methods in this paper lead to several benefits and 
insights: 

• They make a more direct approach to RL since it is the value-gradient 
that affects the pohcy and so it makes sense that this is what should 
be learned. 

• It is only necessary to learn the value-gradient along a single trajectory 
instead of the whole of state space. This can lead to improved efficiency 
for VGL methods, since there is no need to explore locally (see Section 

O). 



• They provide a theoretical insight and convergence result into the long- 
standing problem of proving convergence for value-function learning 
methods with a function approximator, while providing a theoretical 
link between value-function learning and PGL. 

Another goal of this paper is to raise awareness in the RL community of 
the methods of 

The VGL method is a "model based" RL method in that it requires 
that the model functions f{x, a) and r{x, a) to be known. Knowledge of the 
model functions (and their derivatives) delivers many of the above benefits 
of the VGL method. Many researchers define RL specifically for the case 
of unknown model functions. To answer this we would have to supplement 
this VGL method with a separate learning system specifically to learn the 
model functions, prior to trying to learn the policy. This is a commonly 
used strategy for successful RL methods, for exaraple in the recent success 



of maintaining the inverted flight of a helicopter [15[, the model functions 
were learned separately and prior to learning a policy. We also suggest that 
model learning is the relatively straightforward part of the RL task, since 
it is a supervised learning problem, where the immediate answers are given. 
Also the model functions in a control problem are often simple known laws 
of physics, and they do not change much from point to point, due to the 
continuous nature of the environment. exploits this to successfully learn 
the model functions, in real time, entirely while the robot is travelling along 
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one single trajectory. However we acknowledge that ideally, for a full RL 
learning system, we would concurrently learn the model, the policy, and the 
value function while still ensuring convergence. In this paper we successfully 
analyse items two and three of this list in the case of A = 1, but at the 
expense of assuming the first is fixed and known. 

We note that a third paradigm of RL exists called an actor-critic archi- 
tecture. In this architecture there is one function approximator to represent 
V, and a second function approximator to represent the policy. The greedy 
policy is not used. Successful theoretical results exist for the concurrent up- 
dating of the policy and V [l6|, and we discuss these results in comparison 
to our own in section 14. 1[ 

This paper is organised as follows. Section [2] defines the VGL(A) algorithm 
and gives the definitions necessary to do this. Key concepts defined there are 
the approximated value function and its target values; and the approximated 
value gradient and its target. The next two sections give the two main 
theoretical results: In Section [3] we prove the local optimality of the objective 
of VGL for a single trajectory and discuss the efficiency of the method. In 
Section m we demonstrate the connection between VGL and PGL, and hence 
give the convergence proof for VGL with A = 1, under certain conditions. 
Short discussions follow each of the two main theoretical results. Section [5] 
presents the conclusions of our work. 



2. Value Gradient Learning for Reinforcement Learning 

This section defines the VGL algorithm. After some preliminary defini- 
tions are made in section 12. ![ we describe target values which can be used 
to define VL (in section 12.21) . This definition of target values and VL is 
done a concise way that differs from the conventional RL literature, and it 
allows us to define the VGL targets (in section 12.41) and the VGL algorithm 
(section l2.5p . Both of these VGL concepts will be new to readers only expe- 
rienced with VL. A technical difficulty that needs dealing with on the way 
are saturated actions, which are defined in section 12.31 

2.1. Preliminary Definitions 

State space, trajectories and model functions. State Space, S, is a 
subset of S?"'. Each state in the state space is denoted by a column vector 
X. The state space is large and continuous so that a function approximator 
is necessary to represent the learned policy. A trajectory is a list of states 
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{xq, Xi, . . . ,xf) through state space starting at a given point Xq. The trajec- 
tory is parameterised by actions at, chosen from an action space A, for time 
steps t according to a model. The model is comprised of two known smooth 
deterministic functions f{x,a) and r{x,d). The first model function / links 
one state in the trajectory to the next, given action at, via the "Markovian" 
rule Xt+i = f{xt,dt). The second model function, r, gives an immediate 
real-valued reward = r{xt, St) on arriving at the next state Xt+i- 

Some states in S are designated as terminal states. Assume that each 
trajectory is guaranteed to reach a terminal state in some finite time (i.e. the 
problem is episodic). For example, a scenario like this could be an aircraft 
with limited fuel trying to land; or it could be a navigation problem with 
an imposed time limit. For a particular trajectory label the final time step 
t = F, so that xf is the terminal state of that trajectory. Note that for a 
general trajectory, F is dependent on the start point and the actions taken, 
so is not a global constant. 

Action vectors. The action vectors a can be real scalars or have several 
real components, one for each of the control dimensions of the agent. For 
example in a car, the action components might be accelerator pedal position, 
steering wheel angle, and brake pedal position. For a monorail train, there 
might be just one scalar needed. Assume each action component (a*)* is a 
real number that, for some problems, may be constrained to {StY G [—1, 1], 
and these constraints are imposed by the action space, such that St & A for 
all tE 

Policy. A policy is a function 7c{x,z), parameterised by a vector z, that 
generates actions as a function of state. Thus for a given trajectory generated 
by a given policy it, St = vr(xf, z). Since the policy is a pure function of x and 
z, the policy is memoryless. The vector z holds the parameters of a smooth 
function approximator, for example it could be a concatenation in column 
vector form of all of the weights of a neural network. 

Value Function. If a trajectory starts at state xq and then follows a policy 
7i{x,z) until reaching a terminal state, then the value function V^^xqjZ) 



^The choice of the range [—1, 1] is arbitrary. The theoretical results of this paper would 
also apply to any other finite range. 
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returns the total reward received: 

F-l 

t=o 

= r{xo,-K{xo,z)) + V''{f{xo,-K{xo,z)),z) (1) 

withV''{xF,z) ^0. 

Approximate Value Function. We define V{x, w) to be the real- valued 
output of a smooth function approximator with weight vector w and input 
vector X. This is the approximate value function. It is the intention of most 
VL algorithms to eventually find w such that V{x,w) V*{x) for all x, as 
described earlier. 

Approximate Value-Gradient. The approximate value-gradient function 
G{x,w) is defined to be G{x,w) = and this is what the VGL al- 

gorithms learn. Since V{x,w) is defined to be smooth, the approximate 
value-gradient always exists. 

Greedy Policy. We define a greedy policy n{x, w) on the approximate value 
function V{x, w) by: 

7r(x, w) — argmaxfrfx, o) -|- V(f(x, a),w)) (2) 

a€A 

The greedy policy is a one-step look^ahead that decides which action 
to take, based only on th^ model and V. Since for a greedy policy, the 
actions are dependent on V{x,tv) and state, it follows that vr = it{x,w). 
This dependency on w distinguishes how we notate the greedy policy (i.e. 
tt{x,w)) from a general (non-greedy) policy (i.e. 7i{x,z)). We extend the 
definition of the value function V'^{x, z) to also apply to the greedy policy, 
and we write this as V^i^x, w). 

Greedy Trajectory. A greedy trajectory is one that has been generated by 
the greedy policy. Since the greedy policy depends upon the same weight 
vector as V{x,w), any modification to the weight vector w will immediately 
both change V{x, w) and move all greedy trajectories. Hence we say V and 
the greedy policy are tightly coupled; it is the same weight vector w used in 
V{x, w) and the greedy policy 7r{x, w). 

Trajectory Shorthand Notation. For a given trajectory through states 
{xq, Xi, . . . , xp) with actions (aoi Oi, • • • , ap-i), and for any function defined 
on state space (e.g. including V{x,w), G{x,w), V'^{x,w), r{x,a) and the 
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functions defined later in the paper) we use a subscript of t on the function 
to indicate that the function is being evaluated at {xt,at,w). For example, 
rt = r{xt,at), Gt = G{xt,w) and (y'")t=V'^{xt,w). Note that this shorthand 
does not mean that these functions are functions of t. 

Trajectory Shorthand Notation for Partial Derivatives. We use 

brackets with a subscripted t to indicate that a partial derivative is to be 
evaluated at time step t of a trajectory. For example, is shorthand 

i.e. the function |^ evaluated at {xt,w). Also, for example. 



for p 

aw 



{xt,w) 



= 1^1^^ y and similarly for other partial derivatives including (|^)^ 

Matrix-vector notation. Throughout this paper, a convention is used that 
all defined vector quantities are columns, whether they are coordinates, or 
derivatives with respect to coordinates. Also any vector becomes transposed 
(becoming a row) if it appears in the numerator of a differential. Upper 
indices indicate the component of a vector or matrix. For example, Xi IS a 
column; w is a column; Gt is a column; {^^\ is a column; is a matrix 

with element [i-,]) equal to {^^^q£^^ (^f§^ is a matrix with element 
equal to (fii)^- An example product is {%)^Gt+l = Y.i An 
example second derivative of a scalar is ( ^^M^^ = ( = = 

^ \ awax I \ aw ox i ow^ ax^ 

dw'^ ' 

Approximate Q Value function. Since the quantity in the right-hand side 
of the greedy policy (eq. comes up often, we define a function specifically 
for it. The approximate Q Value function (3«] is defined as 

Q{x,a,w) = r{x,a) + V{f{x,a),w) (3) 

The greedy policy therefore maximises this quantity, i.e. the greedy policy 
is such that 7r(x, ly) = argmaxQ:£^(Q(a;, a, to)). 

We will often also need the derivative ^ which is 

aa 

. m 5« (4) 




da J . \da 

2.2. Target Values 

These are useful concepts for understanding value-learning, which we need 
to define here because we will later differentiate them to make the targets 
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for the VGL algorithm. 

For a trajectory found by a greedy pohcy 7r{x,w) on V{x,w), we define 
the "target-value function" V'{x,w) recursively as 



V'{x, w) = r{x, 7r(a;, w))+ yXV'{f{x, tt{x, w)), w) + {1 — X)V{f{x, tt{x, w)), w) 

(5a) 

with V'{xf,w) = and where A G [0, 1] is a fixed global constant (described 
further below). To calculate V for a particular point xq in state space, it 
is necessary to run and cache a whole trajectory starting from xq under the 
greedy policy 7r{x, w), and then work backwards along it applying the above 
recursion; thus V'{x, w) is defined for all points in state space. 

For a given trajectory, using shorthand notation the above equation sim- 
plifies to 

V't = rt+ [xV't+i + (1 - A)yi+i) (5b) 

We refer to the values V't simply as the "target values" since the objective 
of VL is to make Vt equal to V't for all t and along all greedy trajectories, 
since then: 

Vt = V't Vfo,Vt 
^ Vt = rt + Vt+i Vfo,Vt by Eq. HE 

<^=^ V{x,w) = r{x,7r{x,w)) + V{f{x,7r{x,td)),w) Wx 
<^==^ V{x,w) = y(x, w) Vx by Eq. [1] 

So when coupled with the greedy policy (Eq. [2]), V satisfies Bellman's Opti- 
mality Condition, and so the greedy policy will be an optimal policy. 

We point out that since V is dependent on the actions and on V{x,w), 
it is not a simple matter to attain the objective V = V, since changing V 
infinitesimally will immediately move the greedy trajectories (since they are 
tightly coupled), and therefore change V; these targets are moving ones. So 
we should only try to move the values V slowly towards their targets. For 
example a VL function approximator weight update to do this could be: 

t=o \ It 
where a is a small positive constant. 
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The choice of the constant A can be seen from Eq. [5b] to affect the 
targets and hence affect learning by Eq. O If A = then the recursion 
in Eq. |5b] is not needed, and the weight update will force the estimate 
Vt to move towards the estimate Vt+i. Updating an estimate based on an 
estimate like this is commonly called "bootstrapping", and we call A the 
bootstrapping parameter. If A = 1 then the recursion in Eq. [5b] is fully used, 
giving V'{x,w) = V'^{x,w), and the estimate Vt will be updated to move 
towards the actual total reward received until terminating. For other values 
of < A < 1 we get a smooth blending between these two cases as can be 
seen by Eq. 



The function V is identical to the "A- Return" 17|, as proven in Appendix A 



proven in jAppendix A 



but the V recursive formula is more succinct. The weight update of Eq. [^is 
a succinct statement of the TD(A) algorithm in batch update mode, as also 



The use of V greatly simplifies the analysis of value functions and value- 
gradients. Having it defined in this recursive form allows us to easily differ- 
entiate it to form the target value-gradient, and hence define the VGL algo- 
rithms. It would not be straightforward to define the VGL algorithms using 
the traditional formulation of the "A-Return" described in Appendix A 



2.3. Saturated Actions 

An extra complication arises if actions are bounded, e.g. if the constraints 
{atY G [—1, 1] are present for some action components. These need handling 
carefully to be able to differentiate the policy function later in this paper. For 
example if an action component represents the steering wheel of a car, then 
we say that action component is saturated when the steering wheel is rotated 
to its full limit in either direction, with pressure being applied against that 
limit. Formally, for the greedy policy, when the constraints (at)* G [—1,1] 
are present for some action component {atY, we say the action component 

is saturated if \{atY\ = 1 and 7^ 0. If either of these conditions is not 

met, or the constraints are not present, then the action component {atY 
not saturated. 

We sometimes want to refer to just the unsaturated components of action 
vector a. We use the notation u{a) to denote a vector of just the unsatu- 
rated components of a, i.e. this vector often has lower dimension than a 
as any saturated components have been removed. If there are no saturated 
components then u{a) = a. 
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We note some useful Lemmas about saturated actions and the greedy 
policyll 

Lemma 1. // (at)* is saturated, then, whenever they exist, {^^§3^ = o-i^d {^^^^ = 0. 

This follows from the definition of saturated action components above: Imagine if the 
steering wheel is fully turned to the right, with pressure applied. Because of the pressure 
applied, any infinitesimal changes to the circumstances will not change the position of the 
steering wheel. 

Note that (§f )j and niay not exist, for example, if there are multiple joint maxima 

in Q{x, a, w) with respect to a. Then an infinitesimal change to the Q function could cause 
the maximum to flip discontinuously from one of these maxima to the other. 

Lemma 2. If an action component (atY chosen by a greedy policy is saturated, then 
i^tY = 1 ^ (g)^ > 0; and {d,Y = -1 ^ (g)^ < 0. 

The first of these two implications has to be true since for a saturated action 

by definition, and if ^g-^ < at (at)* = 1 then the maximum of Q would not be at 

{(XtY — Ij which contradicts the greedy policy. The second implication is true for the same 
reason with the situation reversed. 



Lemma 3. // an action at chosen by a greedy policy has some unsaturated components, 
then ^ dufa) ) ~ ^ '^'^'^ ( du(3)du(a) ) ^ negative semi-definite matrix. 

The greedy policy has found an action somewhere in the middle of the vector space that 
contains u{a). So this is an ordinary local maxima of a surface, hence possesses the claimed 
properties. 

Lemma 4. For any action at chosen by the greedy policy, regardless of whether any com- 
ponents are saturated or not, whenever (§§)j exists, we have (§f )j (^) ~ ^' 

Proof. (||)( = Si (w) (g) • For each term of this sum, the greedy policy 

implies either ^ (in the case that action component (at)' is saturated and (^^") 

exists, by Lemma [Ij, or = (in the case that {dtY is not saturated, by Lemma [3]). 

Hence each term of the sum is zero, hence the sum is zero. ■ 



^These lemmas could be skipped on a first reading and just referred to as needed later 
in the paper. 
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2.4- Target Value- Gradient, (G't) 

We now define the target vectors G't that will be used as the VGL objec- 
tive which is to achieve Gt = G't for alH > along a greedy trajectory. We 
first define it for A G (0,1] and afterwards extend the definition to include 
A = 0. 

For A G (0, 1], we define the function G'{x, w) = q^'^^ , which gives: 

G't = (^r{x,7T{x,w)) + XV'{f{x,7T{x,w)),w) + (1 - X)V{f{x,TT{x,w)),w))^ (by Eq. M 

dr\ / "^TT \ / dr\ 
dxJt \dx)^\da)^ 

+((!),+ (I), {%)) + - 

(7) 

with G'f = 0, and assuming all derivatives in this equation exist (these 
existence conditions are discussed further below). 

This recursive formula takes a known target value-gradient at the end 
point of a trajectory {G'p = 0), and works it backwards along the trajectory 
rotating and incrementing it as appropriate, to give the target value-gradient 
at each time step. 

For A = 0, we modify the above definition slightly to make the definition 
independent of the existence of We first note that in the special case 

where A = 0, Eq. [7] simplifies as follows: 

I) / (i) , +(£),( (^) ."^ (i) , 



dx J ^ \dx J ^ \dx J ^ \ da 

S) by Lemma II 

(8) 

In this last line there was the assumption that exists, but for A = 

we simply define G't to be equal to this last line. Thus G't is always defined 
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and exists for A = 0. This matches the target val ue-g radient that Werbos 
uses in the algorithms DHP and GDHP (Eq. 28 of [l8|). 

In the special case of A = 1, G't becomes identical to (^-)^ (to see this, 
remember that for A = 1, we have V'{x,w) = V^lx^w)). 

All terms of Eq. [7] are obtainable from knowledge of the model functions 
and the policy. For obtaining the term || it is usually preferable to have 
the greedy policy written in analytical form (e.g. the policy used by joj). 
Alternatively, using a derivation similar to that of Lemma [TJ it can be shown 
that, when it exists, is such that all saturated components satisfy 

= and all unsaturated components satisfy: 



dx 



\ dx \dxdu{a) I \du{a)du{a) 



(9) 



if ( a A? exists. 

y ou[a)ou[a) J ^ 

Existence conditions for the Target Value-Gradient The target value- 
gradient is a key concept of the VGL method, so we should check under 
which conditions it exists. 

If A = then G't is defined to always exist by Eq. [8l If A > and if 
ll does not exist at some time step, to, of the trajectory, then G't is not 
defined for all t < to. Conditions in which || might not exist are mentioned 
in Lemma [Hand Eq. [9l 

It may be that in some problems the environment and model functions 
make it so that G't does not exist, even though the model functions are 
designed to be differentiable. For example, if an agent is at the boundary 
between a terminal state and a non-terminal state, and its velocity is zero, 
then depending on which way it goes next will determine whether the tra- 
jectory terminates or not. Hence the total reward could be vastly different 
in those two cases, and so the function is not differentiable with respect 
to X at that point. These bifurcation points are hopefully rare in state space 
in most problems. 

The above rare occurrences of the non-existence of G' do not affect the 
two main theoretical results of this paper (Sections |3] and H]), since both 
results talk about consequences of when the target value gradient does exist. 

However certain problems would not be suitable for the VGL method 
without their reformulation, for example if the total reward was defined to 
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be equal to the integer number of time-steps in a trajectory. This total 
reward function is a step function, so does not give any useful derivative for 
learning. Even though time needs to be discretized for simulation purposes, a 
calculation of the actual continuous value of time would be needed to make a 
useful differentiable total reward function. If the problem is modified so that 
the model functions more accurately reflect the underlying continuous time 
process then the VGL method will work. As a rule of thumb, if a problem is 
suitable for PGL methods, then it will be suitable to work on VGL methods. 

2.5. The VGL (X) Learning Algorithm 

The objective for any VGL algorithm is to attain Gt = G't for all t > 
along a greedy trajectory. It is proven in section 12] that this objective 
is sufficient to ensure the trajectory is locally extremal, and often locally 
optimal. As with the objective for learning the targets V, it should be noted 
that the VGL objective is not straightforward to achieve since the targets G't 
are moving ones and are highly dependent on w. Hence we must use a weight 
update to slowly move the approximated gradients towards their targets: 



A^ = «EU- ^t{G't-Gt) (10) 




This equation defines the VGL(A) algorithm. It is based on the GDHP 
and DHP algorithms jo], [lij . Our definition of G' in Eq. [7] extends Werbos' 
methods which were defined for only A = 0, to work with any A. Two 



implementations of this algorithm are given in [Appendix B[ 

The Qt matrix is any positive definite matrix (as introduced by 18|). 
arbitrarily chosen by the experimenter. Qt is included for generality, since 
the presence of any positive definite matrix here in the equation will force 
every component of Gt to move towards the corresponding component of G't 
(in any basis). For simplicity Qt is often just taken to be the identity matrix 
for all t (as in Werbos' algorithm DHP). One use for making Qt arbitrary 
could be for the experimenter to be able to compensate explicitly for any 
rescalings of the state space axes. 

It seems an inspired choice by Werbos to have included the matrix Qt 
at all, since in Section H] it spontaneously appears in a PGL weight update, 
giving us an explicit formula we can use to specify Qt when A = 1. In this 
case it suffices for Qt to be positive semi-definite. 
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Any VGL algorithm is going to involve using the matrices and/or 

^ which, for neural-networks, involves second order back-propagation. This 
is described in chapter 10 of [9|. 

3. The Local Optimality of the Value-Gradient Learning Objective 

In this section we define locally optimal trajectories and prove that if the 
VGL objective is achieved, i.e. if G't = Gt for all t along a greedy trajectory, 
then that trajectory is locally extremal, and in certain situations, locally 
optimal. 

We first define the total reward for a given trajectory that is irrespective 
of the policy that was used to find it. For any trajectory starting at state Xq 
and following actions (ao, ai, . . . , dp-i) until reaching a terminal state under 
the given model, the total reward encountered is given by the function: 

F-l 

R{xo,ao,di, . . . ,aF-i) = ^r{xt,at) 

t=o 

= r(fo, oo) + R{f{xo, ao), ai, 02, ... , dp^i) (11) 

with R{xf) = 0. Thus i? is a function of the arbitrary starting state xq 
and the actions. We extend the trajectory shorthand notation to include the 
function R, so that for any given trajectory, Rt = R{xt, St, at+i, . . . , Of-i)- 
This enables us to define the partial derivatives as (|f )^ = and (§f )^ = 

dRt 
ddt 

Locally Optimal Trajectories. We define a trajectory parameterised by 
values (a?o, ao, ai, 0,2, . . .) to be locally optimal if R{xo, do, di, 02, . . .) is at a 
local maximum with respect to the parameters (ao, di, 02, . . .), subject to the 
constraints (if present) that (a^)* G [—1, 1] for each action component i. 
Locally Extremal Trajectories (LET). We define a trajectory parame- 
terised by values [xq, ao, ai, 02, . . .) to be locally extremal if, for all t and all 
action components i, 

i§B)t ~ ^ ^^^^^ saturated 

(If )^ > if (at) Ms saturated and (a^)' = 1 (12) 
^ ^ saturated and (a^)* = —1. 
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In the case that all the actions are unbound, this criterion for a LET simplifies 
to that of just requiring (|f )j = for all t. 

Having the possibility of bound actions introduces the extra complication 
of saturated actions. The second condition in Eq. [12] can be understood by 
the steering wheel analogy given before (in the definition of saturated action 
components); if the action component (a^)* = 1 is saturated then the steering 
wheel is fully turned to the right with pressure, implying we would like to 
turn the car even more in that direction if that were possible (even though 
it isn't), which literally means that {§§:)^ > 0. The third condition in Eq. 
[12] is simply the reverse of this. 

In fact, in this definition R is locally optimal with respect to any sat- 
urated actions. Consequently, if all of the actions are fully saturated (a 
situation known as bang-bang control), then this definition of a LET provides 
a sufficient condition for a locally optimal trajectory. 

Lemma 5. For a greedy trajectory and any fixed A G [0, 1], if G't = Gt for 
all t then G't = Gt = (|f )^ for all t. 

Proof. This is proven by induction. First we note that since Gt is defined to 
exist, then G't must also exist (since G't = Gt for all t). Next, for A G (0, 1], 
substituting G't = Gt into eq. [7] gives. 



^^Jt\\^'^/t V^^/t / \'^^ J t \dx 



V dx J t \ da ) \ dx J , \dx 



t 



(1^1 + ( ?C 1 Gt+i byLemma[l 



\dx J , \dx 



t 



where in the application of Lemma [1] we used the fact that (||)^ exists since 

G't exists. For A = 0, substituting G't = Gt into equation [8] gives the same 
result again, i.e. 

G.= (%\^(%\g,,, (13) 



V ) t \dx 
So Eq. [13] is vahd for all A G [0, 1] 
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Also, differentiating Eq. [TT]with respect to x gives 
fdR\ ^ rdr\ fdn fdR\ 



\dx J ^ \9x J ^ \dx J ^ \ dx J 



t+i 



So (§f )^ and Gt have the same recursive definition. Also their values at the 
final time step t = F are the same, since (§f)^ = Gp = 0. Therefore, by 
induction, G't = Gt= {^)^ for all t. ■ 

Theorem 1. Any greedy trajectory satisfying G't = Gt (for all t) must be 
locally extremal. 

Proof. Since the greedy policy maximises Q{xt, cit, w) with respect to dt at 
each time-step t, we know at each t and for each action component i, 

(l^) ~ ^ if (at)Ms not saturated 

> if {dtY is saturated and {otY = 1 (14) 
(l^) ^ (^*)* saturated and (a^)* = —1. 

These follow from Lemmas [3] and [2l Therefore since, 

dR\ fdr\ fdf\ fdR\ , . . ^ 

^ = TT^ + TT^ T7^ by differentiatmg eq. HU 

da)^ \daj^ \daj ^\dx J 

/dr\ fdf\ ~ 
— + — Gt+i by Lemma |5] 



by eq. H] 



\dd J ^ \dd 
dd 



we have (§f)^ = for all t. Therefore the consequences of the greedy 

policy (Eq. [H]) become equivalent to the sufficient conditions for a LET (Eq. 
T2|) . which implies the trajectory is a LET. ■ 



Corollary 1. If, in addition to the conditions of TheoremUl all of the ac- 
tions are saturated (bang-bang control), then the trajectory is locally optimal. 

This follows from the definitions given above of a LET. ■ 
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Remark 1. According to the Bang-Bang Principle [19[, bang-bang control 
often arises in situations where the model functions are linear with respect 
to bound action vectors, or when the problem being solved is a time min- 
imisation problem. Hence it is often the case that all LETs found by this 
method are locally optimal. 

Remark 2. If the VGL objective (i.e. G't = Gt for all t) is achieved as the 
fixed point of a weight update that is gradient ascent on the total reward 
(e.g. such as the weight update of Eq. [16] in Section H]), then the LET must 
be locally optimal, because the objective was arrived at by gradient ascent 
on the total reward. 

Since the weight update of Eq. [16] is a special case of the VGL (A) weight 
update (Eq. [TUj), it is speculated, but still an open question, that any time 
the VGL objective is met by use of any instance of Eq. [TD] (i.e. while using 
any VLt matrix, or any A), it could always produce locally oj»izma/ trajectories. 

Remark 3. We point out that the proof of Theorem [1] is highly related to 
Pontryagin's maximum principle (PMP), since Eq. [13] satisfies the PMP 
equation for a "costate" vector. Therefore Gt is the costate vector of PMP, 
and the greedy policy (almost) forms the maximum condition of PMP (the 
only difference being that PMP is defined for continuous time systems). This 
completes Pontryagin's conditions to be a LET. However PMP still needs 
supplementing with Lemma [5] for it to be applicable for any A G [0, 1]. We 
did not use PMP explicitly because it only identifies LETs, and the way we 
have formulated the proof allows us to derive the corollary's extra conclusion 
for bang-bang control producing locally optimal trajectories. 

3.1. Discussion 

This local optimality proof is valid for any A G [0, 1]. This optimality 
condition only needs satisfying over a single trajectory, whereas for VL the 
corresponding optimality condition iV' = V) needs satisfying over the whole 
of state space. This implies that VGL methods have a much lesser require- 
ment for exploration than VL methods do, since the local part of exploration 
comes for free with VGL methods. What we mean by this is that provided 
the VGL learning algorithm makes progress towards achieving G't = Gt all 
along a greedy trajectory, the trajectory will make progress in bending itself 
towards a locally optimal shape, and this will happen without the need for 
any stochastic exploration. We argue that this leads to greater efficiency for 



19 



VGL compared to VL, and experiments confirm this in some simple problem 
domains, by several orders of magnitude in most cases 

In comparison to VGL, the failure of VL without any exploration in a 
deterministic environment is dramatic and common, even when the value- 
function is perfectly learned along a single trajectory; examples are given in 
section L3 of [8]. 

A separate efficiency issue is the algorithmic complexity of VL and VGL, 
and these are both the same ((9(dim(w)) per time step) if Algorithm [1] is 
used, but VGL is slower (0(dim(w)(dim(x))^) per time step) if the on-line 
implementation of Algorithm |2] is used. 

The requirement of our optimality proof for episodic problems could be 
relaxed by introducing a discount factor 7 G [0, 1] (see j2| for details), and the 
proof can then be extended by altering the terminal step of the induction of 
Lemma |5] to instead use the boundary condition limj_j.oo 7*G't = 0. However 
it is not yet clear how to extend the proof to an undiscounted non-episodic 
problem. 

The stochastic case for A = is considered by (isf . 



4. The Relationship of VGL to Policy-Gradient Learning 

We now prove that the VGL(A) weight update of Eq. [TOl with A = 1 and 
a carefully chosen Qt matrix, is equivalent to PGL on a greedy policy. 

PGL, sometimes also known as "direct" reinforcement learning, is defined 
to be gradient ascent on V^^XqjZ) with respect to the weight vector z of 
a general policy, i.e. Ai* = a (^-)q for some small positive constant a. 
PGL methods will naturally find local maxima of V^Ixq, z) and have good 
convergence properties. 

A very direct and efficient method to calculate the policy gradient, (^f-jg, 
when the model functions are known is backpropagation through time (BPTT) 



12l | which we will follow here, and it is well suited to deterministic systems 



However most studies of PGL in the RL literature [ll|, |13| are designed for 



stochastic environments and unknown model functions, and they form the 



mean y j after sampling many trajectories. [IJ] describes a rapid model 

learning method that finds the policy gradient after just one trajectory. 
The required PGL gradient can be expanded as follows: 
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{r{x,7i{x,z)) + V-{f{x,Ti{x,z))J))] (byEq.p 



ds)^ ((sa),^ (sa), ( )t+i) ^ ( 5^ \+i 

Expanding this recursion and substituting it into the gradient ascent 
equation gives, 



\ dz J . \\da J , \da J , \ dx , 

t>o ^ / t \ \ / t \ / t \ / t+i 

BPTT is merely an efficient implementation of this formula, often for 
cases where the policy n{x, z) is provided by a neural- network [12|], but also 
defined for a general differentiable policy. We note that although this equa- 



tion looks quite different from the more regularly used PGL equation of [11 



the two are mathematically equivalent since it is proven in [11] that their 
weight update is equal to ^^j-^ 

The above weight update was derived for a general policy. We now switch 
to specifically consider PGL applied to a greedy policy, so that all instances 
of the weight vector £ will be now replaced by w: 

The equivalence we will show only holds when the actions are unbound. 
If bound actions are required then they could be implemented indirectly 
by applying an exponential cost function to the model function r(x, a) to 
penalise components of a that get close to their desired hmits, prohibiting 
the greedy policy from choosing actions beyond this range. 

Initially we only consider the case where (^9")o ^^^^^^ ^ greedy tra- 
jectory, and also where hence (^)j exists for all t. The terms and 
'"^^ reinterpreted under the greedy policy: 

Lemma 6. The greedy policy implies, for an unbound action, (||)^ = — {%)^ Gt+i. 
Proof. Substitute = (Lemma [3] for an unbound action) into Eq. 

m ■ 
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Lemma 7. When (||)^ and (^§^^ exist for an unbound action St, the 
greedy policy implies 

dn\ [ dG\ [dfy ( d^Q 



dw I , \ dw / \da J , \ dada 

Proof. We use implicit differentiation. The dependency of at = n{xt,w) 

on w must be such that Lemma [3] is always satisfied, i.e. so that (^^)^ = 0, 

both before and after any infinitesimal change to w. Therefore the function 
7T{xt,w) must be such that, 

^_ 9 ( dQ{xt,7T{xt,w),w) 



dw \ dSt 

d f dQ{xt,at,w) \ _^ / dTT\ _d_ f dQ{xt,at,w) , ^^^^^ ^ 

dw I ddt j V 9^ J t \ 

^//ar\ ~ ^ + ( ^ (byEqH 

dw \ \da J ^ \da J ^ J \ dw J ^ y dada J ^ 

d (fdr\ , ^fd{fy\ ,~ , (dT:\ ( d^Q\ 

dw I \ 

^ \ da J ^ dw \ dw J ^ y dada J 

( 7^ ) + ( 7^ ) 1^ I (inner product) 



t+l 



dw j \da J ^ \ dw J ^ \ dada 



The penultimate line used the fact that |5 and ^ are not functions of 

^ aa aa 



W. 



Then solving the final line for proves the lemma. 

Substituting these Lemmas [6] and O and (^9")t ~ "with A = 1 (see 
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Eq. [7]), into Eq. \T5\ gives: 



Aw = a 




a 



(16) 



t>o \ / t+l 



where 




^* = -h^ h^ h^h (17) 



and is positive semi-definite, by the greedy pohcy (Lemma |3]). 

Equation [16] is identical to a VGL weight update equation (Eq. [10]), with 

a carefully chosen matrix for Qt, and A = 1, provided (J§)^ and 

exist for all t. If (|§)^ does not exist, then is not defined either. 



This completes the demonstration of the equivalence of a value-function 
learning algorithm (VGL(l), with the conditions stated above) to PGL (with 

dw 



greedy policy; when exists). 



4.I. Discussion 

If the RL problem is such that (||)^ always exists, then the good conver- 
gence guarantees of PGL will apply to VGL(l), under the above conditions. 
For certain simple problems this is always true, but significantly it is always 
true in the continuous time setting for VGL [8], where the value-gradient 
policy defined by jsj is used. Conveniently, this policy also ensures the ac- 
tions are always completely unsaturated, which was a condition for the PGL 
equivalence. 

This equivalence result was surprising to the authors because it was 
thought that the VL and VGL weight updates (equations [6] and [10]) were 
based on gradient descent on the error functions '^tiV't — VtY and J2ti^'t ~ 
Gt)'^flt{G't — Gt), respectively. In fact the weight update of Eq. [6] is some- 
times generally described as a gradient descent weight update of that error 
function (e.g. see chapter 8.2 of j2|). But neither weight update is true gra- 
dient descent, unless both A = 1 and the policy is fixed, i.e. non-greedy. 
For example, equations [6] and [10] have far fewer terms in them than would 
be found by differentiating the two error functions fully using the chain rule 
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e.g. 20| describes the missing terms for a fixed policy with VL and A = 0; 



even more terms are missing if a tightly-coupled greedy policy is used, even 
when A = 1). Therefore learning progress as measured by either error func- 
tion is far from monotonic. So this PGL-VGL equivalence proof is surprising 
because it shows these anomahes for VGL(l) are because the weight update 
is actually gradient ascent on when Qt is chosen carefully, and it neatly 
solves the monotonic progress problem for a tightly-coupled greedy policy. 

It was also surprising to learn that POL and value-function weight up- 
dates are not as fundamentally different to each other as we first thought. 
It was not known that any PGL weight update, when applied to a greedy 
policy on an approximate value function, would be doing the same thing as 
any value function weight update; even if both had A = 1. Of course they are 
usually not the same, unless this particular choice of Qt is chosen. The equiv- 
alence now creates a difficulty in distinguishing between PGL and VGL(l) 
with this particular Qf However, we describe the above weight update as a 
VGL update; it is of the same form as Eq. [10] which was defined by [18j, i.e. 
prior to this equivalence being realised. 

If A = 1 is used then Eq. [T7]is a good choice for fit, since it will ensure 
monotonic progress with respect to , under the above conditions. However 
Eq. [T7] means Qt can sometimes be very small, which could slow learning 
down. Alternative choices for Qt (such as the identity matrix) may hence 
produce an aggressive speed up of learning, but will do so at the expense of 
monotonic progress. A less aggressive speed up method for the gradient as- 
cent, such as conjugate gradients, could be used instead of using the identity 
matrix for Qf 

This analysis has been successful for a tightly-coupled greedy policy and 
value function, which is unusual since most RL analyses of value-function 
updates in the literature so far have only been applicable for a "fixed" policy. 
Interestingly, using a tightly-coupled value-function and greedy policy was 
necessary for the equivalence to hold. 



16l | provides a related convergence result that also applies to the problem 



of concurrently updating V and vr. Their result applies to an actor-critic 
architecture, and since this does not use a greedy policy, they avoid the need 
for considering ^ through Lemma [3 While our result compared to theirs 
is disadvantaged in that it is only valid for A = 1, some possible advantages 
of our method over theirs are as follows: Their result is thought to apply 
only when the function approximator for V is linear in the same features of 
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the state vector that the function approximator for the pohcy uses as input 



see footnote 1 of [16|). Also the pohcy iteration algorithm that updates 



both function approximators requires that the function approximator for V 
is trained to completion, over all of state space, every time the policy changes, 
and this requirement is nested inside of the loop that updates the policy; so 
the whole thing is quite computationally expensive. Finally, in order for the 
training process for V to have guaranteed convergence, it would have to be 
linear in to if A < 1 



5. Conclusions and Further Work 

We have proven the local optimality of learning the target value gradients 
along a greedy trajectory (for any A), argued the efficiency benefits of that 
approach, and have demonstrated the equivalence of VGL(l) to PGL. In this 
research we have been interested in genuine theoretical challenges to under- 
standing value-function learning with a greedy policy, regardless of whether 
by VL or VGL; particularly about how the greedy policy is affected by a 
weight update to V{x,w) (as derived in Lemma [7]), and particularly about 
what exactly is required for an optimal trajectory (Section [3]). 

In further work we would like to extend the optimality proof of Section 
|3]to undiscounted non-episodic problems, and of course somehow work how 
to extend the convergence proof for A = 1 to include A < 1 , which is unlikely 
with the weight update in its current form since divergence examples for this 
exist (e.g. section 4.3 of jsl). 
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Appendix A. Equivalence of V notation to the A-Return 

Although it was first discovered by Watkins [17], we use the definition 
of the A-Return given by [2]: = (1 - A) ^^^^ A"-^/?J"^ with = 
Yll^Zo^t+k + Vt+n- We aim to show that V't is identical to R^. Expanding 
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the definition of gives 

oo / n— 1 \ 

n=l \fc=0 / 

oo 

= (1 - A) {W + X\n + n+i) + X\n + n+i + n+2) + ...) + (1 - A) ^ A^ 

n=l 

00 / 00 \ 00 

= (1 - A) 5^ n+„ ^ AM + (1 - A) ^ X'^V.^n+i 

n=0 \ k=n / n=0 

00 00 

= J2 (A"n+n) + (1 - A) ^ A"yi+„+i 
n=0 n=0 

Expanding tlie definition of V (Eq. [5bl) gives 

V't = rt+ (An+i + (1 - X)Vt+i) 

00 

= 5^ A" (ri+„ + (1 - A)i/i+„+i) 

n=0 

Tfius V is identical to i?'^. However since it uses a recursive notation, 
V^' is easier to differentiate tlian i?'^, enabling us to define G'. The A-Return 
provides an equivalent formulation for the algorithm TD(A) known as the 
"forwards view of TD(A)" [2]. This proves that Eq. [6] is equivalent to the 
TD(A) weight update. 

Appendix B. A batch mode implementation, and an on-line im- 
plementation of the VGL(A) algorithm 

We give two implementations of the VGL(A) algorithm which produce an 
identical weight update. Algorithm [1] is the faster of the two, but requires 
storage of a whole trajectory and is batch-mode only. Algorithm [2] can be 
used on-line and is more memory efficient since it does not store a whole 
trajectory, but is slower since it requires the manipulation of an "eligibility 
trace" matrix. 

Here we use a shorthand notation as = ^ + f|^- We let and rix 
be the dimensions of w and x respectively. 7 is a constant discount factor, 
where 7 G [0,1]. 
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Algorithm [T] makes a direct implementation of Eq. [ID]by making a foward 
pass through the trajectory, storing all states and actions, followed by a back- 
ward pass through the trajectory accumulating G't by the recursion in Eq. 
[71 In this implementation, the second order derivatives of the approximate 
value function (i.e. |^ and ||) are only required as an inner product with a 
vector. This means that if a neural network is used for the function approx- 
imator, and if we use methods analogous to those used by j^H or ch. 10 of 
joj, then these matrix-vector products can be evaluated in O(n^) operations. 
This means it takes O(Fn^) operations for the algorithm to run on a whole 
trajectory. 



Algorithm 1 VGL(A). Batch-mode implementation. 



t ^ 0, A^Z; ^ 

{Unroll trajectory...} 

while not terminated(x() do 

at ^ n{xt,w) 

xt+i ^ f{xf,at) 

t^t + l 
end while 
F 

p 



10 
11 
12 

13 



14 
15 
16 



{Backwards pass...} 

for t = F - 1 to step -1 do 



Aw ^ Aw+ ( 11 



G 



end for 

w ^ w + a Aw 



Algorithm |2] accumulates all the weight updates in a single forward pass 
of the trajectory. It requires no storage of the trajectory, so is more memory 
efficient, but requires more time to carry out matrix multiplications. This 
algorithm requires the full |^ matrix which, for a neural network, would take 
O^n^n^) operations to evaluate. Hence the slowest steps in the algorithm 
would be the matrix-matrix multiplications of lines [10] and [T21 each taking 
0((^a;)^?T'?i)) operations. Hence the total time for the algorithm to process a 
full trajectory is 0{F{nx)^nyu) operations. 
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To derive this algorithm, we had to first rewrite Eq. [7] as follows: 



<5, = I + 7 f ^ 1 G.+i - G, (B.2) 



=^ + A7(§Cl (G'.+i-G,+,) (B.l) 

where we define 

Dx J J yZ^x 
Unrolling the recursion in {G't — Gt) of Eq. IB. II gives 

Then substituting this into the VGL(A) weight update equation (Eq. 
and reordering the terms gives: 

5i+2 



F-l 

aY,{Et6?) (B.3) 

t=o 



where Et is a matrix defined to be 
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We see that Et can be defined more simply by a recursion: 



(B,4) 



t-1 



with E_i = 0. We call the matrix E^ an "eligibility trace" matrix because it 
acts similarly to the eligibility trace described for TD(A) Algorithm [2] is 
then easily derived from Equations IB.2t IB.3I and IB.4I 



Algorithm 2 VGL(A). On-line implementation. 



1: E ^ {E e 



is an "eligibil- ^" 



ity trace" workspace matrix.} 

t <- 0, Aw ^ 6 

while not terminated(xi) do 

dt ^ 7r(ft,zZ;) 
xt+i ^ f{xt,at) 

if not terminated(x4+i) then 



10: 

11 
12 
13 
14 
15 



end if 

E ^ E + 



Aw ^ Aw + ES 

t^t + l 
end while 

w w + a Aw, Aw ^ {This line 
can be moved up one position if true 
on-line weight updating is required.} 



Neither of the two implementations in this section attempts to learn the 
value gradient at the final time-step of a trajectory since it is prior knowledge 
that the target value gradient is always zero at any terminal state. Hence we 
assume the function approximator for V{x, w) has been designed to explicitly 
return zero for all terminal states x. 
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